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Abstract. The discovery of nearly 200 extrasolar planets during the last 
decade has revitalized scientific interest in the physics of planet formation and 
ushered in a new era for astronomy. Astronomers searching for the small sig- 
nals induced by planets inevitably face significant statistical challenges. For 
example, radial velocity (RV) planet searches (that have discovered most of the 
known planets) are increasingly finding planets with small velocity amplitudes, 
with long orbital periods, or in multiple planet systems. Bayesian inference has 
the potential to improve the interpretation of existing observations, the planning 
of future observations and ultimately inferences concerning the overall popula- 
tion of planets. The main obstacle to applying Bayesian inference to extrasolar 
planet searches is the need to develop computationally efficient algorithms for 
calculating integrals over high-dimensional parameter spaces. In recent years, 
the refinement of Markov chain Monte Carlo (MCMC) algorithms has made it 
practical to accurately characterize orbital parameters and their uncertainties 
from RV observations of single-planet and weakly interacting multiple-planet 
systems. 

Unfortunately, MCMC is not sufficient for Bayesian model selection, i.e., 
comparing the marginal posterior probability of models with different parame- 
ters, as is necessary to determine how strongly the observational data favor a 
model with n -I- 1 planets over a model with just n planets. Many of the obvious 
estimators for the marginal posterior probability suffer from poor convergence 
properties. We compare several estimators of the marginal likelihood and fea- 
ture those that display desirable convergence properties based on the analysis 
of a sample data set for HD 88133b Fischer et al. (2005). We find that methods 
based on importance sampling are most efficient, provided that a good analytic 
approximation of the posterior probability distribution is available. We present 
a simple algorithm for using a sample from the posterior to construct a mixture 
distribution that approximates the posterior and can be used for importance 
sampling and Bayesian model selection. We conclude with some suggestions for 
the development and refinement of computationally efficient and robust estima- 
tors of marginal posterior probabilities. 



1. Introduction 

Recent collaboration between astronomers and statisticians has led to a better 
understanding of the particular challenges associated with Bayesian analysis of 
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dynamical planet detections. In this paper, we briefly review the state of the 
art of Bayesian parameter estimation, model selection, and experimental design 
in the context of extrasolar planet searches. Then, we discuss recent work on 
Bayesian model selection, demonstrating the properties of several estimators of 
the marginal posterior probability using an actual set of data from an RV planet 
search. 



1.1. Observational Data 

In RV surveys, the velocity of the central star is precisely monitored for periodic 
variations which could be caused by orbiting companions (see Fig. 1, left). Each 
individual observation can be reduced to an estimate of the observational uncer- 
tainty ((Tfc) and a measurement of the star's RV {vk) relative to the jkth. velocity 
reference at time t^- Because each RV measurement is based on calculating the 
centroid of thousands of spectra lines and averaged over hundreds of sections 
of the spectrum, the observational uncertainties of most current echcUc based 
RV s urveys can be accurately estimated and are nearly Gaussian ( Butler et al.l 
There may also be intrinsic stellar variability ("jitter") that we model as 
an addition source of uncorrelated Gaussian noise with variance and add to 
the measurement uncertainties in quadrature. If the velocity observations (v^) 

were generated by the model specified by M and model parameters 6, then the 
probability of measuring the observed velocities is 



m = p{v\e, M) = l[ . exp 



2 + s^) 

assuming that the errors in individual observations and the "jitter" are both 
normally distributed and uncorrelated. 

1.2. Theoretical Model 

Specifying the mass and six phase space coordinates of each body in a plane- 
tary system at a specified time provides a complete description of the system. 
In practice, it is convenient to choose the osculating Keplerian orbital elements 
(orbital period, P, orbital eccentricity, e, inclination relative to the plane of the 
sky, i, argument of periastron measured from the plane of the sky, uj, longitude of 
ascending node, O, and mean anomaly, M) for each planet in Jacobi coordinates, 
since the mean anomaly is the only one of these orbital elements that changes 
with time for a planet on an unperturbed Keplerian orbit. The observed stellar 
velocity is the sum of the line of sight velocity of the center-of-mass and the pro- 
jection of the reflex velocity due to any planetary companions onto the line of 
sight. For multiple planet systems, it can be important to use complete n-body 
simulations to model the planetary motions accurately (e.g., GJ876; Laughlin 
et al. 2005; Rivera et al. 2005). However, in most cases, the mutual planetary 
perturbations are negligible on time scales comparable to the duration of obser- 
vations. In such cases, the RV perturbations due to a multiple planet system 
can be modeled as the linear superposition of multiple non-interacting Keplerian 
orbits, vg{t,j) = Cj + J2p^'Vpi't), where Cj is the jth velocity reference. While 
there is a single mean line of sight velocity of the center of motion, it is important 
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Figure 1. Left: The 17 RV observations of HD 88133 published in 
iFischer et al.l l)2005|) plotted versus the time of observation. Right: The log 
posterior probability marginalized over all model parameters except the or- 
bital period, assuming a single planet on a circular orbit (i.e., the Bayesian 
generalization of the periodogram) and the observational data shown on the 
left. 

to use separate constants, Cj for each observatory/spectrograph pair, since the 
velocities are measured differentially relative to a reference velocity that is unique 
to each observatory. The perturbation to the stellar RV {^Vp) due to a planet 
on a Keplerian orbit is given by Avp{t) = Kp [cos {ujp + Tp) + Cp cos(u;p)] where 
p labels the planet, K is the velocity semi-amplitude and T is the true anomaly, 
which implicitly depends on time. The true anomaly (T) is related to the eccen- 
tric anomaly {E) via the relation tan (r/2) = (1 + e)^/^^! _ g)-i/2 ^^^^^1/2 (^e/2). 
The eccentric anomaly is related to the mean anomaly (M) via Kepler's equa- 
tion E{t) - e sin {E{t)) = M{t) - Mo = 27rt/P - Mo where Mo is a constant, the 
orbital phase at t = 0. Unfortunately, RV observations alone are not sensitive 
to the orbital inclination relative to the plane of the sky (i) or the longitude of 
ascending node (fi). Therefore, RV observations by themselves can only mea- 
sure the minimum mass ratio, m^[^/M^ = msini/M*, that is a function of K, 
P, e, and M*, where M^, is the stellar mass, typically estimated by independent 
astronomical observations. 

1.3. Bayesian Framework 

To quantitatively analyze the available observational constraints, we employ the 
techniques of Bayesian inference. By treating both the observation and the 
model parameters as random variables, Bayesian inference is able to address 
statistical questions in a mathematically rigorous fashion. The joint probability, 
p{v, 0\^A), can be expressed as the product of the likelihood {L{6) = p{v\9, M.)), 
the probability of the observables given the model parameters) , and a prior prob- 
ability distribution function {p{9\A4)) which is based on previous knowledge of 
the model parameters. Note that each of the probability distribution functions 
(PDFs) is conditioned on the assumption of a model, A4, that includes the mean- 
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ing of the model parameters, 6, and their relationship to the observational data, 
V. Bayes's theorem allows one to compute a posterior probability density func- 
tion, p(9\v,M), which incorporates the knowledge gained by the observations 

pie^v^M) = ^mm^.. (2) 
jpie\M)piv\e,M)de 

This paper is particularly interested in the case when there are multiple viable 
models (e.g., no planet model, one planet model, two planet model, etc.) for 
the current data set. In this case, the posterior for a given model and set of 
parameters is given by p{9,^A\v) = p{M.)p{^A\v)p{9\v,M.), where p{Ai) is the 
prior probability of model A4 and the marginal posterior probability for model 
M. is given by 

m{v) = p{M\v) = I p{9\M)p{v\e, M) d9. (3) 



We briefly review recent progress in sampling from p{9\v, Al) via MCMC in §1.5 
and introduce and compare algorithms for evaluating of m{v) in §2. 

1.4. Choice of Priors 

In Tabled we list the priors used for each model parameter in this work. While 
the choice of these and other parameters can be fine-tuned for a given problem, 
we suggest these choices as a starting point for the Bayesian analysis of radial 
velocity data sets. Physical and geometric considerations lead to natural choices 
for the prior PDFs for most of the model parameters (see Table A few of the 
priors merit closer attention. The cutoff at a minimum orbital period is chosen 
to be less than the smallest orbital period of known extrasolar planets, but this is 
somewhat larger than the theoretical limit (the Roche limit occurs at ~ 0.2d for 
a lOMjup planet around a IMq star). The cutoff at a maximum orbital period 
is much longer than any known extrasolar planet, but is chosen to be roughly 
where perturbations from passing stars and the galactic tide would disrupt the 
planet's orbit. The cutoff at a maximum velocity semi-amplitude is chosen to be 
a function of orbital period, so that the cutoff corresponds to a constant planet- 
star mass ratio. In this paper, we set i^max = 2129m/s, which corresponds to 
a maximum planet-star mass ratio of 0.01. This choice is primarily based on 
the observed distribution of extrasolar planet masses. Clearly, there are stellar 
binaries with much larger velocity amplitudes, but this limit can be considered 
the definition of a planet. While the possibility of arbitrarily small masses 
(and hence velocity amplitudes) prevents a physical justification for a lower 
cutoff, i^^miri) it is not possible to detect or constrain the orbital parameters of 
a sufficiently low-mass planet. To keep the prior distribution normalizable, we 
impose breaks in the priors for K and s at Kq and sq, chosen to be at a velocity 
amplitude less than the smallest detectable velocity amplitude. For a data set 
with A'obs RV measurements with typical measurement precision of a, we suggest 
setting Kq < ay^BO/No^s (Cumming 1999). For the reference prior, we chose 
So = Kq = Im/s, somewhat a rbitrarily, but motivat ed by the current state of 
the art in RV planet searches ((Asher Johnson et alJ [?006). For systems where 
a planet is clearly detected, the shape of the posterior will not be sensitive to 
our assumptions about the prior for small values of K, but for planets which 
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are marginally detected, this choice may become significant. If the posterior 
distribution has significant probability near K ~ Kq, then one should check how 
sensitive any conclusions are to the choice of Kp. 



Table 1. SAMSI Exoplanet Working Group Reference Priors 



Model Parameter 



Variable 



Prior Distribution 



Minimum 



Maximum 



Amplitude of jitter 



(s+sq) 



log ; 



m/s 



Velocity offset 

Orbital period 
Velocity semi-amplitude 

Orbital eccentricity 
Argument of periastron 
Orbital phase 



Parameters for each velocity reference 

Co 7 



M,, 



Parameters for each planet 



los 



log(P„ax/Pmi„) 



1. + - 



1/3 



1 
J_ 

2-K 



I d 

m/s 







10^ yr 



1 

277 



1.5. Previous Research 

Identifying the correct orbital solution from a set of RV (or other dynamical) 
observations is challenging due to the necessity of considering a very large pa- 
rameter space of possible solutions (see Fig. 1, right). For RV planet searches, 
there are at least five model parameters per planet and one model parameter 
per observatory. To make the global search problem tractable, RV data sets 
are traditionally searched for sinusoidal signals (potential planets) using a pe- 
riodogram. The advantage of the periodogram is that it is extremely tractable 
computationally (0(n log n)) and can be quite useful for identifyii rg potential 
orbit al periods. Then Levenberg-Marquardt (LM) minimization ( Press et al.l 



^Q^) is applied starting from several initial guesses of orbital solutions near 
each of the potential orbital periods identified by the periodogram. If the qual- 
itv of the fit is cons istent with the combination of the observation al unc ertainties 
( Butler et al.lll99^ and the expected intrinsic stellar variability ( Wright. .20051) . 



then estimates of the uncertainties in orbital parameters are obtained by repeat- 
edly finding the best-fit orbital parameters (with LM) to several synthetic data 
sets generated via bootstrap resampling (with replacement) of the observational 
data (Press et al. 1992). These methods work quite well for analyzing stars with 
a single planet on a low-eccentricity, short-period (relative to the duration of the 
observations) orbit when the velocity perturbation is large (relative to the mea- 
surement uncertainties). Since such planets are the easiest to discover, they are 
common among the known sample of planets, and the traditional frequentist 
methods have proven quite valuable in their discovery. 
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In recent years, RV searches have become increasingly sensitive to planets 
with small velocity amplitudes and/or long orbital periods, as well as plan- 
ets in multiple planet systems. Bayesian inference can help with each of these 
challenges and has the potential to significantly improve the sensitivity of detec- 
tions and accuracy of orbital determinations. Recent work has begun to develop 
the framework a nd computational tools to make this happen. For example, 
ICummiii3 ( 20041 ^ discussed the relationship between the periodogram m ethod 
and a B ayesian analysis that assumes any planet is on a circular orbit. iFordI 
(20063) combined brute force Monte Carlo (to integrate over orbital period) 
and the Laplace approximation (to integrate over the remaining model parame- 
ters) to render Bayesian model selection practical for planets assumed to be on a 
circular orbit (e.g., short-period planets prone to tidal circularization) . MCMC 
has been applied to estimate orbital parameters and their uncertainty and to 
help understand the situations where traditional frequenti st metho ds leave sig- 
nificant room for improvement ( Ford 112005. : .Driscol .200^ 1 . iGregor 1 (l2005a ) and 
iFordI ((2006al ) hav e automated the determination of parameters of the candidate 
transition PDFs. iFordI (j2006al 'l has also identified non-linear candidate transition 
PDFs that dramatically accelerate the convergence of MCMC. These advances 
make it computationally feasible for MCMC to characterize the allowed orbital 
solutions for single planets o r weakly - interac ting multiple planet systems (e.g. 
iFord. Lvst ad fc R,asicl l2005l V [G regorvl (|2005bl ^ has applied parallel tempering to 
allow MCMC to explore multiple orbital solutions widely separated in parameter 
space. ((Loreddl2004l ) developed the theoretical framework for applying Bayesian 
adaptiv e experimental design to dynamical extrasolar planet searches, and iFordI 
( 2006bl l developed computationally practical algorithms for applying these tech- 
niques to adaptively schedule radial velocity observations. Bayesian techniques 
are jus t beginning to be applied to analyze the population of extrasolar planets 
( Ford R,asiQ.200a : .LoredQ.2006.) . 



2. Algorithms for Applying Bayesian Model Selection To Extrasolar 
Planets 

While MCMC techniques have proven very efficient for sampli ng from the poste- 
rior distribution for orbital parameters of extrasolar planets ()Fordll2006al . e.g.), 
MCMC does not directly determine the normalizing constant of the posterior 
distribution. While this is not necessary for parameter estimation (within a 
single model), it is essential when co nsidering mu ltiple possible models (e.g., no 
planet, one planet, two planets...). IClvdd (|2006t ) reviews the state of modern 
techniques for Bayesian model selection from a statistics perspective. In this 
paper, we introduce several estimators of the marginal posterior and test their 
performance on the radial velocities for HD 88133 published in iFischer et al.l 
mOh) . While our test data set consists of purely RV observations, we expect 
that most of our findings are also directly applicable to other dynamical planet 
searches (e.g., astrometric, pulsar/white dwarf timing). Other types of planet 
searches (e.g., transits, microlensing, direct imaging) likely present different chal- 
lenges. Nevertheless, the challenge of estimating marginal posterior probabilities 
is quite general. Thus, we expect our finding may provide insights into methods 
for Bayesian model selection in other areas of astronomy and statistics. 
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2.1. Sampling from Prior 

Basic Monte Carlo The most obvious basic Monte Carlo (BMC) estimator of 
m(v) is based on drawing 0i from the prior and discretizing the integral in Eqn. 

01 to create the estimator rfiBMc{v) = Y^ll=iL{6) /n. Unfortunately, the very 
large parameter space makes this totally impractical, even for a single planet 
system. Using the prior in Tabled we drew over 10^ samples, but ruBMciy) 
underestimated m{6) by orders of magnitude while the internal error estimate 
suggested a random error of 2%. This is due to the fact that not a single sample 
landed in the dominant peak in the likelihood (see Fig. 1, right). 

Restricted Monte Carlo The BMC estimator can be easily modified by sam- 
pling from only a small subset of the prior. Using MCMC, we sample from the 
posterior, select one model parameter for investigation and then marginalize over 
all the remaining parameters. Then, we use the marginalized posterior distribu- 
tions to identify the subset of parameter space with non-negligible probability 
(e.g., 99.9% credible interval). Using this technique, we identified a region with 
volume, VrmC: — 2 x 10^ times less than the volume of the prior distribution, 
l^Prior- Then, we can estimate, 

mRMc{v) =^^j2m)ln, (4) 

^^Prior 

where are drawn from a distribution proportional to the prior over the re- 
stricted range of parameter values and zero elsewhere. For our test case, the 
RMC estimator provides a reasonable estimate of m{v)j^j^(j (see Fig. 2, top, 
solid curve). However, this estimator has several short comings. First, it is 
biased due to the fact that it includes the probability coming from only one 
hypercube of parameter space. If we choose a large subvolume, then the esti- 
mator converges slowly, since most samples miss the high likelihood regions. If 
we choose a small subvolume, then we may neglect a significant region of prob- 
ability outside our hypercube. These problems are exacerbated for data sets 
where there are significant correlations between parameters and/or many model 
parameters. While we were able to choose an effective subvolume for the test 
data set, the estimator converged slowly, requiring ~ 2 x 10^ samples to reach 
5% accuracy. Finally, we note that prohibitively large subvolumes can be nec- 
essary for other data sets that allow a broader range of orbital solutions and/or 
significant correlations between parameters. 

Partial Linearization & Laplace Approximation In the approximation of circu- 
lar orbits, the predicted velocity can be written as a linear function of functions 
of all the model parameter except P and s. Thus, for given values of P and 
s, there is a single global maximum of the likelihood that can be quickly lo- 
cated by solving a set of linear equations. We can analytically integrate over 
the remaining model parameters (Ki, Mj, Cj) using the Laplace approxima- 
tion by evaluating the likelihood at the global maximum (for a given P and 
s) and the Fischer informati on matrix evaluated at that point (for details see 
ICummindl2nn3 : iFordI I2nn6bl 'l . For a system with Np planets, this leaves only 
1+Np dimensions to be explored by brute force Monte Carlo, making the Monte 
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Carlo integration dramatically more efficient. For small eccentricities, the ve- 
locity perturbation due to a planet can be approximated by a series expansion 
in eccentricity. If we use the O(e^) (epicycle) approximation, then the predicted 
velocity can again be linearized over all model parameters except P and s, allow- 
ing the this technique to efficiently consider for small eccentricities. We choose 
a test case to have a modest velocity amplitude and small eccentricity, so that 
the circular and epicycle approximations provide reasonable approximations for 
the radial velocity perturbations due to HD 88133b. Thus, we expected partial 
linearization would provide at least an order of magnitude estimate of the true 
marginal posterior probability, as well as a point of reference for comparing other 
estimators. Indeed, a comparison of this estimator to the more sophisticated es- 
timators of §2.3-2.5, we find that this estimator of m{v) shows a systematic bias 
(Fig. 2, top, long-dashed curve), likely due to the Laplace and the epicycle ap- 
proximations. When the linear approximation to the velocity is not adequate, 
then a similar technique can be used, but the predicted velocity is non-linear 
in the parameters Pi, Cj, Mi, and s, so the partial linearization leaves 1 -|- 3A'p 
dimensions to be explored by brute force. While partial linearization is useful 
for analyzing systems with one planet (e.g., iFordI l2006bt ) . it rapidly becomes 
computationally impractical for multiple planet systems. 
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Figure 2. Comparison of several estimators of the marginal posterior prob- 
ability for single planet models and the 17 RV observations of HD88133 shown 
in Fig. 1. Left: Estimators of the marginal probability for a one-planet 
model. Right: Internally estimated standard deviation of each estimator. 
The line styles indicate the algorithms used for each estimator: Restricted 
Monte Carlo (top, solid), Partial Linearization (top, long-dash), Harmonic 
Mean (top, short-dash), Gelfand & Dey with Partial Linearization (top, dot- 
ted), Importance Sampling from single Normal (bottom, dotted), Importance 
Sampling from Mixture of Normals (bottom, solid) , Gelfand & Dey with Mix- 
ture of Normals (bottom, long-dash), Ratio Estimator (bottom, short-dash). 



2.2. Sampling from Posterior 

Given the serious difficulties in sampling from the prior, we proceed to consider 
estimators of the marginal posterior that sample from alternative distributions. 
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MCMC provides a computationally efficient tool for sampling from the posterior. 
Since a Bayesian analysis will typically use MCMC to generate a sample from 
the posterior for the purposes of parameter estimation, we investigate estimators 
that can use such a sample to calculate m{v). 

Harmonic Mean Newton & Raftery (1994) propose the estimator rhNR{v) = 
n/ Yl^=i^/ ^{Oi). Unfortunately, this estimator displays extremely poor con- 
vergence properties (Fig. 2, left, top, short-dash curve), as it has an infinite 
variance. Newton & Raftery (1994) suggested sampling from a mixture of the 
posterior and prior to obtain an estimator with finite variance, but we found 
similarly poor performance. 

Weighted Harmonic Mean & Partial Linearization Gelfand & Dey (1994) use 
the identity 

[m{v)]-^ = j ^ p{e\v)de, (5) 



p{o)m 

where h{9) is an arbitrary density to create an estimator for the marginal pos- 
terior probability, 

^ / /i(0*) 

i=i p{0')L{e') 

where 0* are drawn from the posterior. This estimator should perform well 
when h{9) approximates the posterior in regions of high probability, and it has 
finite variance when the tails of h{9) decay faster than the tails p{9)L{6) (see 
ICarlin LouiJ2nn n'. SB.3.n. We set /i(^) equal tope(^|i?) = p{9)Le{9)/ J d9p{9)Le{9), 
where L(>{9) is a likelihood computed as in Eqn.^ but replacing the Keplerian 
model with the epicycle approximation for computing the velocity perturbation 
of each planet and pe{9\v) is the posterior under this approximation. The nor- 
malization is calculated using partial linearization & the Laplace approximation 
as discussed in §2.1.3. Unfortunately, we find that even this approximation 
displays poor convergence in our test case (Fig. 2, left, top, dotted curve). 

2.3. Simple Importance Sampling 

Given the poor convergence properties of the previous estimators based on sam- 
pling from the prior and posterior densities, we investigate estimating m{v) via 
importance sampling. Importance sampling requires that we specify a properly 
normalized density h{9) which we can both evaluate at any 9 and sample from 
efficiently. We estimate m{9) with 

h{9^) 

where 9^, ... ,9^ are drawn from h{9). We choose h[9) = N{i!}o, S), where N is 
the multivariate normal distribution in a set of transformed model parameters, 
'&{9) (lFordll2nn6al 'l. We use the transformation 'd{9) = {logP, i^cos(a; + M), 
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K sin (uj + M), ecosw, esino;, C, logs} to help reduce non-linear correlations 
between model parameters. We use a sample from the posterior to determine the 
location, i?o, and sample covariance, S', in the transformed coordinates. We find 
that the variance of this estimator is reduced if we scale the sample covariance 
matrix by a factor ^ ~ 2 to obtain S = ^S', the covariance matrix used in the 
importance sampling density. 

In our test case, rhis,N appears to be a very robust and efficient estimator 
for this data set (Fig. 2, bottom, dotted curve). Indeed, there would be no 

need to pursue more sophisticated estimators, if fhis,h{G) performed so well 
on all data sets. However, we are concerned that this estimator will not be 
viable for other data sets where the posterior can not be well approximated by 
a single multivariate nor mal distribu tion. This is likely to occur in systems with 
long orbital periods (see lFord1l2005l . Fig. 1), small data sets, and/or when the 
velocity amplitude is small. The posterior typically is dominated by a single 
peak for most published RV data sets (almost by definition, since a data set 
that can be explained by two qualitatively different orbital solutions would not 
be considered to have discovered the planet and is unlikely to be published). 
However, if Bayesian model selection is to be used for deciding when a planet 
has been detected or as a part of Bayesian adaptive design, then it will be 
necessary to analyze data sets before the posterior is so strongly peaked that it 
can be well approximated by a single normal distribution. Therefore, we proceed 
to develop a more sophisticated importance sampler that can be more robust 
when analyzing such data sets. 

2.4. Sampling from a Mixture Density 

Basic Importance Sampling In an effort to develop an importance sampling 
density suitable for application to a general RV data set, we consider a mixture 
of multivariate normal distributions. We assume that MCMC has already been 
used to obtain a good sample (6*^, . . . , 0"') from the posterior and use this to 
construct an importance sampling density, 




where we have randomly chosen = 100 samples to be removed from the origi- 
nal posterior sample and to be used as the locations for the mixture components, 
gj{0). We choose each mixture component to be a multivariate normal distribu- 
tion, gj{6) = N{i}[9)\'&{6^),T,j), where we must determine a covariance matrix 
for each gj using the posterior sample. First, we compute p, defined to be a 

vector of the sample standard deviations for each of the components of t?, using 
the posterior sample. Next, we define the distance between the posterior sample 

9^ and the center of gj{0), dfj = J2k {^k{^^) — '^ki^-')^ I Pk-, where k indicates 

the element of •& and p. We draw another random subset of Ucv = 50nc sam- 
ples from the original posterior sample (without replacement), select the ncv/um 
posterior samples closest to each mixture component and use them to calculate 
the covariance matrix, S^, for each mixture component. We set Sj = ^S^, and 
<; = 1. Thus, we have developed an automated algorithm for using a posterior 
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sample to construct an importance sampling density, g{6)- Since the posterior 
sample is assumed to have fully explored the posterior, g{0) should be quite 
similar to the posterior in all regions of significant probability, provided that we 
use enough mixture components. 

We use g{9), Us samples from the remainder of our posterior sample, and de- 
terministic mixture sampling to compute the estimator, fhis^g{v). We find that 
it performs quite well in our test case (Fig. 2, bottom, solid curve). It converges 
nearly as rapidly as fhis,N{v) and appears somewhat more robust, even for the 
rather simple posterior in our test case. In tests on more complex data sets, 
we find that fhis,g{v) can be significantly more robust than 'fhis,N{v) for data 
sets with somewhat less well constrained posterior PDFs, but both estimators 
perform poorly on other data sets with very diffuse posterior PDFs. We spec- 
ulate that our method for choosing the mixture components could be replaced 
by a more sophisticated algorithm that might result in a superior importance 
sampling densities for challenging data sets. 

Defensive Importance Sampling Despite the success of fhfs,g, we have some 
concerns about the robustness of fnjs^g for high dimensional parameter spaces 
(e.g., analyzing systems with several planets). As the number of model pa- 
rameters increases, it will become increasingly difficult to avoid p{9)L{9) / g{9) 
becoming unusually large for some values of 9. To prevent this we generalize our 
importance sampling density to include a component from the prior, by defining 
gQ{9) = P0) and g*{9) = E]Lo 9j (0) / {nc + 1), Following l(>^en fc Zhoul (120001 
we combine this mixture density with control variables to obtain the estimator 



m 



nis,g*A-)-,^l. 9*m ho'' 



which is valid for any choice of /?. To minimize the variance of this estimator, we 
set P = (3*, where /?* is determined by least squares fitting to the linear system 
of Us equations 



j=0 




g*{9^ 



(10) 



lOwen Sz Zhoul ( 2000^ show that this estimator is never worse than an estimator 



based on any subset of the mixture components. In practice, we need Ug to be 
large to have small variance, but it is not practical to solve the linear system of 
equations with — 10^"''. Therefore, we repeatedly solve for (3 using subsets 
of rir <C Us posterior samples and average the results to estimate The 
resulting estimator fh^jj^ least as good as fhis,g{v) and is expected 
to be considerably more robust. In our test case, the estimators fhis,g{v) and 
^Dis * 0* follow each so closely that the curves would be indistinguishable 
in Fig. 2 (bottom, solid curve). This is because the values of (3j are roughly 
comparable for all components with j > 1, while /?o (the prior component) is 
orders of magnitude less than the other (3j. This refiects the fact that g{9)^ 
our mixture of Uc = 100 multivariate normal distributions, was sufficient to 



12 



Ford &: Gregory 



accurately approximate the posterior density. While fnj^^g g, ^, (v) is somewhat 
more computationally expensive, we still prefer it to mjs,g{v), since it should be 
more robust. Further, if g{6) was inadequate, then (3q would increase, alerting 
us to the potential weakness of g{0). 

Weighted Harmonic Mean We now reconsider the estimator of Gelfand & Dey 
(1994), but using the mixture, g{9), for the weight function h. The resulting 
estimator rhwHM,g{v) can also be thought of as the reciprocal of an estimator of 
g{0) / {p{9)L{B)) using importance sampling from the (unnormalized) posterior. 
When using the estimator rhwHM,giv), the denominator for each term in the 

summation contains p{9)L{9) evaluated at points sampled from the posterior, 
so the limit to the variance of this estimator will be the size (and quality) of 
the sample from the posterior. This seems acceptable, since these considerations 
will limit any estimate of the marginal likelihood based on a sample from the 
posterior. Further, this seems more attractive than algorithms which place the 
importance sampling density in the denominator, since that could result in areas 
with sparse coverage (e.g., due to high dimensionality) dominating the summa- 
tion. We show the performance of fhwHM,g{v) in Fig. 2, bottom, long-dashed 
curve. While this estimator performs reasonably well, it has a larger variance 
and appears to be converging more slowly than rhjs^g for our test case. Addi- 
tionally, We find that this estimator is particularly sensitive to the choice of ? 
and rapidly degrades if ? is too large or too small. 

2.5. Sampling from Multiple Densities 

Ratio Estimator We present a new estimator (Berger, private communication), 
based on the identity 

The key insight is to approximate the numerator by drawing a sample 9^, . . . 9^" 
from h{9) and to approximate the denominator by drawing a sample 9^, . . . 9^" 
from the posterior (e.g., via MCMC). This yields the ratio of estimators. 

This estimator seems particularly promising, since both the numerator and de- 
nominator are separate sums and there is no risk of a small denominator leading 
to a large variance, as in importance sampling. If we combine this estimator 
with the g{9), the mixture of normal distributions used in N2.4.I (again using 
a distinct subsample from the posterior for constructing g{9)), then we obtain 
the estimator fhRE,g- In our test case, the numerator converges significantly 
more rapidly than the denominator, and so we choose = n^/lO with mini- 
mal impact on the variance of the estimator. This estimator performs very well 
in our test case (Fig. 2, bottom, short-dashed curve). It converges as rapidly 



Model Selection & Exoplanets 



13 



as any of the other estimators that we considered, with the possible exception 
of importance sampUng from a single normal distribution. Further fhRE,g{v) 
appears to be quite robust, in that it does not display sudden jumps when a 
single additional sample significantly changes the value of the estimator, as is 
more common with most of the other estimators. Unfortunately, we found that 
this estimator was less accurate on more complex test cases, yet it showed no 
warning signs that the estimator had not converged, even after ^ 10^ samples. 



Parallel Tempering iGregorvl (20053) introduced the method of parallel tem- 



pering for estimating the marginal posterior probability for extrasolar planet 
observations. In parallel tempering, several Markov chains are run in parallel, 
each with a slightly different target density, TTp{6) = p{9\Ai)p{v\6 , A4)^ , where 
P is an inverse temperature parameter that varies between and 1. In the 
parallel tempering algorithm, each Markov chain typically evolves according to 
the usual candidate transition PDFs, but periodically the algorithm proposes an 
exchange of states between two Markov chains that have slightly different values 
of p. The "high temperature" Markov chains (/3 ~ 0) will explore a very broad 
region in parameter space and can help the "coldest" Markov chain (/? = 1) 
to sample from the full posterior distribution, even when there are narrow and 
widely separated peaks in the posterior distribution ( Gregorvll2005lJ ) . In princi- 



ple, the marginal posterior probability can also be calculated from the ensemble 
of Markov chains, using 



fhpTiv) = p{M) exp y dfiJl log [Pi^\kp^ ^)\ I ' (13) 



where Oi^js is the ith state in the Markov chain with target distribution vr^(0), 
and the integral over (3 is to be approximated by an appropriate weighted sum 
over the values of (3 used by the various Markov chains. 

We show the performance of fhpT{v) using 32 tempering levels for the test 
data set in Fig. 3. Based on five completely independent realizations of the 
parallel tempering algorithm, we found that four provided a reasonable estimate 
of the m{v), but one set of chains resulted in an estimate roughly twice as large 
as any other estimator tested. If this set of Markov chains had been the only 
realization, none of the usual diagnostics would have recognized that it had not 
yet converged. Therefore, we feel that more work is needed to understand the 
properties of marginal posterior estimates obtained from parallel tempering. If 
the sensitivity of the estimator were due to the slow convergence of the highest 
temperature chains (/3 ~ 0), then their contribution to the integral over /? could 
be approximated by an analytic sampler from the prior (corresponding to /3 = 0). 
However, we have verified that for our test case (see Table|2), the Markov chains 
with very small values of P make only minor contributions to the integral over 
p. The second column of Table El gives the fractional error that would result if 
this decade of P was not included and thus indicates the sensitivity of the result 
to that decade. Therefore, we suspect that the chains limiting the accuracy of 
rhpT{9) have p > W~^. 
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Figure 3. The estimate of the marginal probabihty for a one-planet model 
and the 17 RV obser vations of HD881 33 shown in Fig. 1 using the parallel 
tempering method of iGreg-orvl ^^2005a^ as a function of the number of iter- 
ations in each of the Markov chains. Each of the lines corresponds to a 
completely independent set of 32 Markov chains. Since each curves is based 
on Markov chains with 32 different values of (3, even the parallel tempering 
simulations that stopped after 320,000 iterations required roughly five times 
more likelihood evaluations than the each of the estimators shown in Fig. 2. 



Table 2. Fractional error versus /3 for the results shown in Fig. 3. 
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2.6. Other Techniques for Model Selection 

While we are encouraged by the recent progress in developing efficient and ro- 
bust estimators of the marginal posterior probability, there are several additional 
avenues of research that might lead to even more desirable estimators. One in- 
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teresting method is based on the of nested samphng methods of lskiningl(l2nn,^ . 
Unfortunately, for the problem of radial velocity planet searches, there is no ef- 
ficient way to sample only from high posterior probability regions of parameter 
space, as required by nested sampling. As a result, we can only apply nested 
sampling if we employ the very inefficient method of rejection sampling. 

Another class of algorithms for estimating Bayes factors relies on sampling 
over the model space. Two subclasses of methods have been the s ubject of muc h 
research in the statistics community: r eversible jump MCM C (lGreenl ll99,'^. 
and model composition MCMC (MC^) (fCarlin fc Chiblll995lj. Unfortunatelv 
we fi nd that the most obvious choices of pseudopriors (e.g. ICreen fc O'Hagad 
|199^ for MC^ result in very poor mixing between different models. Perhaps 
future research can adapt these methods to allow for more rapid mixing between 
models with different numbers of planets. Similarly, simplistic implementations 
of reversible jump methods seem unlikely to be practical, since the trial jumps 
into higher dimensional spaces will only land in areas of significant probability 
on extremely rare occasions. On the other hand, we are more optimistic about 
reversible jump algorithms that employ an analytic approximation to each of 
the posterior PDFs within the ith model {p{9\v, A4i)) for the transdimensional 
steps. We envision that each of the analytic approximations could be based 
on mixture models constructed from a sample from p{9\v, A4i) obtained using 
conventional MCMC techniques, similar to the importance sampling densities 
we employed for mjs,giv). 

An even more radical idea is to abandon the computation of marginal poste- 
rior probabilities in favor of some other statistic to aid in quasi-Bayesian model 
selection. Penalized likelihood methods such as the Akaike and Bayesian infor- 
mation criterion do not seem well justified when the posterior is significantly 
non-normal or multi-modal. Additionally, we are suspicious of any method that 
penalizes all parameters equally, as our model is much more sensitive to some 
model parameters (e.g., orbital period) than to others (e.g., eccentricity). There- 
fore, we are more interested in exploring methods based on the predictive distri- 
bution. Unfortunately, any of these alternative methods for model selection is 
somewhat arbitrary and less than ideal for the purposes of adaptively scheduling 
observations based on the principles of Bayesian adaptive experimental design. 



3. Conclusions 

In this paper, we have reviewed several methods for calculating the marginal 
posterior probabilities in the context of RV planet searches. One the positive 
side, we found that several algorithms were able to accurately calculate the 
marginal posterior probability for a simple test case, where there was a single 
dominant peak in the posterior probability distribution. However, all of the 
estimators based on sampling from either the prior or posterior had serious 
short comings. 

The method of partial linearization can be a useful tool for rapidly com- 
puting relatively low accuracy estimates of m{v) for data sets with one planet 
or even ~ 1 — 3 planets on low eccentricity orbits. However, it rapidly becomes 
computationally intractable when there are multiple planets with significant ec- 
centricities. Restricted Monte Carlo {fhnMciv)) can be useful for planets with 
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large eccentricities, but is computationally feasible only once the orbital param- 
eters are relatively well constrained. Parallel tempering is able to estimate m{v) 

even for multimodal posterior distributions, but for our test data set mpT{0) 
converged more slowly than all of the other algorithms tested (except basic 
Monte Carlo). For our test case, we found no regime where the harmonic mean 
{rhMR{v)) or the weighted harmonic mean {fhwHM,h{v)) would be the most de- 
sirable estimator. The new ratio estimator {ffiREiv)) performed very well for 
our test case, but wc recommend proceeding with caution, based on preliminary 
tests with more complex data sets. 

Based on our tests, the most promising methods are based on importance 
sampling using an analytic density that mimics the posterior (e.g., mjs,N{v) 
or "rnj-fjg g, 0t{v))- When the posterior has a single dominant peak that can 
be reasonably approximated by a multivariate normal distribution, then sim- 
ple importance sampling (fhis^Niv)) provides a very efficient tool for estimating 
marginal posteriors. In cases where the posterior is more complex (e.g., multiple 
peaks and/or non- linear parameter correlations), then importance sampling can 
still be useful when combined with a mixture distribution based on a sample from 
the posterior that can be readily calculated via standard MCMC. Refinements 
to the basic importance sampling algorithm (e.g., f^^pus g* ^^"^ provide 

increased robustness and offer a tool for diagnosing when the mixture distribu- 
tion is sufficient. We hope that future research will improve our understanding 
of these estimator's theoretical and real-life properties, as well as lead to ad- 
ditional refinements. In particular, we hope to investigate how the estimator 
^Dis g* 0* performs on more widely dispersed posterior distributions and on 
higher dimensional problems (e.g., multiple planet systems). 

In a sense, we can consider the problem of Bayesian model selection to have 
been reduced to the problem of constructing an analytic approximation to a 
probability density based only on a set of samples from the distribution. Un- 
fortunately, we recognize our algorithm for constructing importance sampling 
densities is not yet sufficiently robust to be applied generally. Therefore, we 
would like to sec additional research that would improve the robustness and 
computational efficiency of these algorithms. Fortunately, we recognize several 
ways our current algorithm could be improved. For example, rather than cen- 
tering the mixture components on random samples from the posterior, it might 
be possible to make do with a smaller number of mixture components. Perhaps 
methods making use of Voronoi tessellations and/or quasi-Monte Carlo meth- 
ods could be beneficial in constructing good mixture distributions with fewer 
components, improving the computational efficiency of these methods. 

Finally, we note that this field is still young, and additional research is 
needed to explore a wide range of methods for estimating m{9), including meth- 
ods on importance sampling, parallel tempering, reversible jump MCMC, MC^, 
and nested sampling. 
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